Graphical EDA I: continous data

Team 1

04 January 2021

Structure

1. Examining Continuous Variables

2. Looking for Structure: Dependency Relationships and Associations

3. Investigating Multivariate Continuous Data

1. Examining Continuous Variables

Exercise 1. Galaxies

Solution

a) Histogram

Qplot automatically increases the bin size of the histogram, which shows a bimodal distribution with tails that increase on both sides of the histogram.

library(MASS)
library(ggplot2)
library(mclust)
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
data(galaxies)
galaxies <- as.data.frame(galaxies)
names(galaxies) <- 'Velocity'
par(fig=c(0,1,0,1),new=T)
qplot(galaxies$Velocity) +
  labs(title='Histogram of Galaxy Velocity',
       x='Velocity of Galaxy',
       y='Frequency')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Density estimate

The density plot of the model shows three distinct superclusters with the far right tail not being as distinct.

library(mclust)
mod <- mod <- Mclust(galaxies$Velocity)
par(fig=c(0,1,0,1),new=T)
plot(mod,what="density")

c) Different plots:

In order to present all the information, I think we need at least 5 different plots to spot all the factors the data set can provide. Boxplot, histogram, rugplot, dotplot, they can all provide different informations.

Exercise 2. Boston housing

Solution

a) The histograms treats different atributes:

There are several different histogram forms, each telling a separate story. Default binwidths, dividing each variable’s range by 30, have been used. Other scalings could reveal more information and would be more interpretable. Is interesting that the vertical scales vary from maxima of 40 to over 400. Plotting histograms individually, choosing binwidths and scale limits are the main decisions to be taken.

b) ZN and BLACK variable’s boxplots might be better they make such an efficient use of the space available.

Exercise 3. Student survey

Solution

data(survey, package="MASS")
par(fig=c(0,1,0,1),new=T)
hist(survey$Height,
     xlab = 'Height',
     main = 'Histogram of Student`s Height',
     ylab = 'Frequency')

plot(survey$Height,what="density")

b) Examination of national survey data on young adults shows that the separation between the distributions of men’s and women’s heights is not wide enough to produce bimodality.

Exercise 4. Movie lengths

Solution

library(ggplot2)
data(movies, package="ggplot2movies")
par(fig=c(0,1,0,1),new=T)
hist(movies$year[movies$length == 90 | movies$length == 7],
     xlab = 'Year',
     main = 'Histogram of Number of movies after 1980',
     ylab = 'Nr.')

a) The histogram shows that we have the peaks of 7 minutes or 90 minutes length for both periods: before 1980 and after 1980.

  1. In order to classify a movie as short or as long, I think that using denisity estimation is a good ideea. We can set the limit length for a movie to be classified as short.

Exercise 5. Zuni educational funding

Solution

a) Histogram or boxplot?

table

library(lawstat)
data(zuni, package="lawstat")
par(fig=c(0,1,0,1),new=T)
hist(zuni$Revenue,
     xlab = 'Revenue',
     main = 'Revenue Histogram',
     ylab = 'Nr.')

I prefer a histogram for showing 5% the lowest and the highets.

b) Density estimation:

mod <- mod <- Mclust(zuni$Revenue)
par(fig=c(0,1,0,1),new=T)
plot(mod,what="density")

Exercise 6. Non-detectable

Solution

There is no “h39b.W1” attribute on CHAIN variable because it has been renamed in “log_virus”. For both cases I would use a histogram because I can easily see the number for each case.

library(mi)
## Loading required package: Matrix
## Loading required package: stats4
## mi (Version 1.0, packaged: 2015-04-16 14:03:10 UTC; goodrich)
## mi  Copyright (C) 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 Trustees of Columbia University
## This program comes with ABSOLUTELY NO WARRANTY.
## This is free software, and you are welcome to redistribute it
## under the General Public License version 2 or later.
## Execute RShowDoc('COPYING') for details.
library(ggplot2)
data(CHAIN)
par(fig=c(0,1,0,1),new=T)
hist(CHAIN$log_virus,
     xlab = 'Case',
     main = 'Histogram of virus cases with 0s',
     ylab = 'Nr.')

library(mi)
library(ggplot2)
data(CHAIN)
par(fig=c(0,1,0,1),new=T)
hist(CHAIN$log_virus[CHAIN$log_virus != 0],
     xlab = 'Case',
     main = 'Histogram of virus cases without 0s',
     ylab = 'Nr.')

Exercise 7. Diamonds

Solution

a) Diamond Weight

A diamond’s weight can be found in “carat” attribute. Let’s see how can we see

library(ggplot2)
data(diamonds)
ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

I wanted to put in balance the weight of a diamond with it’s price. Aparently the most expensive diamonds’s weight is between 1,5 and 3 grams. Some of the most cheapest diamonds have weight the least.

b) Distribution of prices

data(diamonds, package="ggplot2")
par(fig=c(0,1,0,1),new=T)
hist(diamonds$price,
     xlab = 'Price',
     main = 'Histogram of Diamonds Prices',
     ylab = 'Frequency')

For the distribution of Diamond Prices, I chose a histogram. I think it is very easy to understand looking at this histogram that the most expensive diamonds are the fewest. I think a factor that the most expensive diamonds are the fewest is that those diamonds are very rare and very hard to find. Another factor is that it requires more work than the others.

2. Looking for Structure: Dependency Relationships and Associations

Exercise 1. Movie ratings

Figure 5.7

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(movies, package = "ggplot2movies")
 ggplot(movies, aes(votes, rating)) + geom_point() + ylim(1, 10)

  1. Excluding all films with fewer than 100 votes:
library(ggplot2)
library(dplyr)

data(movies, package = "ggplot2movies")

filtered <- filter(movies, votes > 100)
ggplot(filtered, aes(votes, rating)) + geom_point() + ylim(1, 10) 

  1. Excluding films with average rating greater than 9 and also the ones that have more than 100000 votes:
library(ggplot2)
library(dplyr)

data(movies, package = "ggplot2movies")
filtered2 <- filter(movies, rating < 9 | votes>100000 )

ggplot(filtered2, aes(votes, rating)) + geom_point() + ylim(1, 10) 

Exercise 2. Meta analysis (Olkin95 dataset)

  1. Number of observations in each experimental group (n.e) against the corresponding number of observations in each control group (n.c):
library(meta)
## Loading 'meta' package (version 4.15-1).
## Type 'help(meta)' for a brief overview.
data(Olkin1995)

ggplot(Olkin1995, aes(n.exp, n.cont)) + geom_point()

# summary(metabin(ev.exp, n.exp, ev.cont, n.cont, data = Olkin1995))
#print(Olkin1995)
  1. Restricting the scatterplot to only those with less than 100 patients in each group:
library(meta)
data(Olkin1995)
ggplot(Olkin1995, aes(n.exp, n.cont)) + geom_point() + ylim(1, 100) + xlim(1,100)

We can observe easier now some outliers in the scatterplot, which were not visible before.

Exercise 3. Zuni

  1. Scatterplot of average revenue per pupil (Revenue) against the corresponding number of pupils (Mem):
library(lawstat)
data(zuni)
#print(zuni)
ggplot(zuni, aes(Revenue,Mem)) + geom_point()

Observations:

  1. Plotting against log of the number of pupils:
library(lawstat)
data(zuni)
ggplot(zuni, aes(Revenue,log(Mem))) + geom_point()

Logging also revenue per pupil adds no other insight to the scatterplot, as shown below:

library(lawstat)
data(zuni)
ggplot(zuni, aes(log(Revenue),log(Mem))) + geom_point()

Exercise 4. Pearson heights

  1. Scatterplot of the heights:
data(father.son, package="UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point()

  1. Including both points and highest density regions:
library(hdrcde)
## This is hdrcde 3.3
data(father.son, package="UsingR")

par(mar=c(3.1, 4.1, 1.1, 2.1))
with(father.son,hdr.boxplot.2d(fheight, sheight, show.points=TRUE, prob=c(0.01,0.05,0.5,0.75)))

Note: mar= A numeric vector of length 4, which sets the margin sizes in the following order: bottom, left, top, and right. The default is c(5.1, 4.1, 4.1, 2.1).

  1. Fitting a linear model to the data and a loess smooth:
data(father.son, package="UsingR")

ggplot(father.son, aes(fheight, sheight)) + geom_point() + geom_smooth(method="lm", colour="red") + geom_abline(slope=1, intercept=0)
## `geom_smooth()` using formula 'y ~ x'

A nonlinear model is not necessary, as the two curves are almost identical:

data(father.son, package="UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() +
geom_smooth(method="lm", colour="red", se=FALSE) +
stat_smooth()
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Exercise 5. Bank discrimination

A subset of Roberts’ bank sex discrimination dataset from 1979 is available in the package Sleuth2 under the name case1202.

  1. Scatterplot matrix of the three variables: Senior, Age and Exper:
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
data(case1202, package="Sleuth2")
#summary(case1202)
par(mar=c(1.1, 1.1, 1.1, 1.1))
#spm(select(case1202, c(4:5,7)), diagonal="histogram", smoother=FALSE, reg.line=FALSE) #groups=bank$Status)
ggpairs(case1202[,c(4:5, 7)], title="Bank discrimination", diag=list(continuous='density'), axisLabels='none')

  1. Scatterplots involving seniority do not have the structure of the scatterplot of experience against age because:

Exercise 6. Cars

Figure 5.8.

data(Cars93, package="MASS")
#print(Cars93)
ggplot(Cars93, aes(Weight, MPG.city)) + geom_point() +
geom_smooth(colour="green") + ylim(0,50)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Note :fuel economy decreases with weight quite quickly initially and then more slowly.

  1. Plotting 1/MPG.City (litres per 100 km instead of miles per gallon) against Horsepower:
data(Cars93, package="MASS")

ggplot(Cars93, aes((1/MPG.city), Horsepower)) + geom_point() + geom_smooth(colour="green") 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Note: The lighter the car and the more horsepower it has, the better the gas mileage. The outliers represent: - Cars that are of type: Midsize, Sporty or Large (with horsepower above 200); mpg/city: 16-18-19-22 - Cars that are of type: small, compact, mpg.city: 19-20-25

data(Cars93, package="MASS")
# filtered <- filter(Cars93, Horsepower <200 & MPG.city >20)
filtered <- filter(Cars93, Horsepower > 100)
#print(filtered)

Exercise 7. Leaves

The leafshape dataset in the DAAG package includes three measurements on each leaf (length, width, petiole) and the logarithms of the three measurements.

  1. Sploms for the two sets of three variables
library(GGally)
library(ggplot2)
data(leafshape, package="DAAG")
#summary(leafshape)
par(mar=c(1.1, 1.1, 1.1, 1.1))

ggpairs(leafshape[,c(1:3)], title="Standard Leaf measurements", diag=list(continuous='density'), axisLabels='none')

#leafshape$arch <- unlist(leafshape$arch)

ggpairs(leafshape[,c(7:5)], title="Logaritmic Leaf measurements", diag=list(continuous='density'), axisLabels='none')

#, mapping = ggplot2::aes(colour=leafshape[8]), lower = list(continuous = wrap("smooth", alpha = 0.3, size=0.1)))

3. Investigating Multivariate Continuous Data

Exercise 1. Swiss

Solution

library('GGally')

# (a)
ggparcoord(data = swiss, columns=c(1:6), scale="uniminmax", alphaLines=0.2) + 
  xlab("") + ylab("")

(b) There might be some outliers depending on different variables:

(c) It looks like the variable Catholic has 2 modes, one at the lowest range (0 - 0.25) and one at the highest ends (0.8 - 1.0). So it is the case in Switzerland that usually a province will either have a majority of catholic or of non-catholic people.

# (d)
swiss1 <- within(swiss, 
                 catholics_level <- factor(ifelse(Catholic > 80, 'High', 'Low')))

ggparcoord(data = swiss1[order(swiss1$Catholic),], columns=c(1:6), scale="uniminmax", 
           groupColumn="catholics_level", alphaLines=0.5) + 
  xlab("") + ylab("") + 
  theme(legend.position = "none")

(d) The provinces with high level of Catholic look like they have a higher index of Fertility, a lower level of Examination (i.e. % draftees receiving highest mark on army examination) and Education. The Infant.Mortality variable looks like it is not affected that much by whether the province has a majority of catholic or non-catholic people.

Exercise 2. Pottery

Solution

# (a)
library('HSAUR2');
## Loading required package: tools
ggparcoord(data = pottery, columns=c(1:9), scale="uniminmax", alphaLines=0.2) + 
  xlab("") + ylab("")

(a) In this pcp we could see different details:

# (b)
# print MgO column as sorted, and select the right threshold (which is 1.0)
sort(pottery$MgO)
##  [1] 0.53 0.56 0.60 0.63 0.67 0.67 0.67 0.67 0.68 0.72 1.50 1.56 1.62 1.62 1.63
## [16] 1.65 1.67 1.81 1.82 1.83 1.83 1.86 1.92 1.94 1.94 1.99 2.00 2.05 2.06 2.06
## [31] 2.33 3.43 3.47 3.77 3.88 3.94 4.26 4.30 4.52 5.34 5.51 5.64 5.69 5.91 7.23
pottery1 <- within(pottery, 
                   mgo_level <- factor(ifelse(MgO < 1, 'Low', 'High')))  # use the 1.0 threshold here

ggparcoord(data = pottery1[order(pottery1$MgO),], columns=c(1:9), scale="uniminmax", 
           groupColumn="mgo_level", alphaLines=0.5) + 
  xlab("") + ylab("") + 
  theme(legend.position = "none")

(b) On the other variables, the cases with low MgO also have lower Fe2O3, CaO, Na2O, K2O, MnO than the other cases. Also, some of these cases have higher values for Al2O3 and TiO2.

# (c)
ggparcoord(data = pottery, columns=c(1:9), scale="uniminmax", 
           groupColumn="kiln", alphaLines=0.5) + 
  xlab("") + ylab("") + geom_line(size=0.7)

(c) In this pcp we can see some differences between different kilns as follows:

Exercise 3. Olive oils

Solution

# (a)
library('pdfCluster')
## pdfCluster 1.0-3
## 
## Attaching package: 'pdfCluster'
## The following object is masked from 'package:dplyr':
## 
##     groups
data("oliveoil")
ggparcoord(data = oliveoil, columns=c(3:10), scale="uniminmax", alphaLines=0.2) + 
  xlab("") + ylab("")

(a) Some of the observed features in this pcp:

# (b)
ggparcoord(data = oliveoil, columns=c(3:10), scale="uniminmax", 
           groupColumn="region", alphaLines=0.7) + 
  xlab("") + ylab("") 

(b) In this pcp we can find that:

(c) For the scatterplot matrix down below:

While for a pcp:

ggpairs(oliveoil[,c(3:10)], title="Olive acids", diag=list(continuous='density'), axisLabels='none')

Exercise 4. Cars

Solution

data(Cars93, package="MASS")
col_indices = which(names(Cars93)%in%c('Price', 'MPG.city', 'MPG.highway', 'Horsepower', 'RPM', 'Length', 'Width', 'Turn.circle', 'Weight'))
ggparcoord(data = Cars93, columns=col_indices, scale="uniminmax", alphaLines=0.7) + 
  xlab("") + ylab("") 

(a) In this plot we could conclude the following:

(b) I would plot a pcp (down below), and here we could observe some differences between USA and non-USA cars:

# (b)
ggparcoord(data = Cars93, columns=col_indices, scale="uniminmax", 
           groupColumn="Origin", alphaLines=0.7) + 
  xlab("") + ylab("")

(c) Yes, a pcp with uniminmax scaling is informative, since we got to extract some insights from it in (a) and (b). Down below we can find the same pcp but with a standard scale applied (subtracting the mean and dividing by the standard deviation, for each axis) and having its observations categorized by the number of Cylinders. Here we can also see that:

# (c)
ggparcoord(data = Cars93, columns=col_indices,
           groupColumn="Cylinders", alphaLines=0.7) + 
  xlab("") + ylab("") + geom_line(size=0.75)

Exercise 5. Bodyfat

Solution

data(bodyfat, package="MMST")
ggparcoord(data = bodyfat, columns=1:15, scale="uniminmax", alphaLines=0.7) + 
  xlab("") + ylab("") 

(a) There is clearly one outlier which has the maximum value 1.0 on the uniminmax scale in this pcp for many variables (bodyfat, weight, neck, chest, abdomen, hip, thigh, knee, biceps, wrist). Not being the tallest man from the sample (looking at its height), I would say this outlier is not an athlete, but maybe a person with serious obesity problems.

Particularly, there are 2 more outliers on the ankle axis, who might also be outliers on the hip, abdomen and chest measurements.

(b) The height variable looks like it has many points of concentration, like it would be a categorical variable. Maybe the reason why this is happening is that the height was measured only in one decimal instead of using a higher precision. We can quickly check this down below:

# by looking at some of the observations, actually it looks like the data contains estimations to the closest quarter float (0.00/0.25/0.50/0.75).
bodyfat$height[1:10]
##  [1] 67.75 72.25 66.25 72.25 71.25 74.75 69.75 72.50 74.00 73.50

So the reason for this “categorical behaviour” is indeed the low precision of the floating numbers.

(c) As seen in the pcp, as the density increases, the bodyfat decreases, and vice-versa. So there is clearly a negative correlation between those 2 variables. This is also suggested by the intersection of all profiles in (what appears to be) a single point.

(d) Yes, the ordering of the variables can affect the pcp display. I think in this specific dataset we can try ordering the variables after their medians, so we can see all the measurement categories from highest to lowest to make a better idea about our data.

medians = apply(bodyfat[, 1:15], 2, median, na.rm=TRUE)
ordered_medians_indexes = order(medians)
ggparcoord(data = bodyfat, alphaLines=0.3,
           scale="globalminmax", order=ordered_medians_indexes) + coord_flip()

Here we can see for example that the wrist has the smallest measurement out of all the body measurements and that the chest is the largest part of a man’s body.

Exercise 6. Exam marks

Solution

(a) I think a good pcp to present to others would be the following, with globalminmax scale and y limits 0-100, where we can deduce that:

library('SMPracticals')
## Loading required package: ellipse
## 
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
## 
##     pairs
## 
## Attaching package: 'SMPracticals'
## The following object is masked from 'package:HSAUR2':
## 
##     smoking
## The following object is masked from 'package:GGally':
## 
##     pigs
## The following objects are masked from 'package:MASS':
## 
##     cement, forbes, leuk, shuttle
ggparcoord(data = mathmarks, columns=1:5, scale="globalminmax", alphaLines=0.7) + 
  xlab("") + ylab("") + coord_cartesian(ylim=c(0,100))

We could try to plot a pcp by highlighting the students who scored better on the Vectors exam. In this pcp we can observe that:

mathmarks1 <- within(mathmarks, 
                     vectors_perf <- factor(ifelse(vectors > 60, 'Good', 'Bad')))
ggparcoord(data = mathmarks1, columns=1:5, alphaLines=0.7, scale="globalminmax", 
           groupColumn='vectors_perf') + 
  xlab("") + ylab("") + theme(legend.position = "none") + coord_cartesian(ylim=c(0,100))

# (b)
ggparcoord(data = mathmarks, columns=1:5, scale="globalminmax", alphaLines=0.1, boxplot=TRUE) + 
  xlab("") + ylab("") + coord_cartesian(ylim=c(0,100))

(b) By using the boxplots, we can see easier the maximum, minimum, median values and the outliers for each exam. By looking at this plot, we can’t tell for sure that the Mechanics and Vectors were closed-book exams. We can say for sure that the Statistics exam (an open-book exam) has the lowest median mark out of all the subjects which would contradict somehow the idea of it being an open-book exam.

However, by looking at the minimum value (~3 points) of the Mechanics exam, we can say it is the lowest among all subjects, which confirms the fact it has been a closed-book exam.

Regarding the polygonal lines, I think it is not mandatory to draw then in a boxplot pcp in this specific scenario, even by using the alphaLines option, since it just fills the space without any useful additional information, as it can be seen above.

Exercise 7. Wine

Solution

(a) By using the following pcp’s, which separate the profiles by the wine class (red - Barbera, green - Barolo, blue - Grignolino), we can easily distinguish which variables help us by classifying different wines:

library('gridExtra')
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
data(wine, package="MMST")
a = ggparcoord(wine, columns=1:13, groupColumn="class",scale="uniminmax") + xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("red","grey", "grey")) + coord_flip()
b = ggparcoord(wine, columns=1:13, groupColumn="class",scale="uniminmax") + xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("grey","green", "grey")) + coord_flip()
c = ggparcoord(wine, columns=1:13, groupColumn="class",scale="uniminmax") + xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("grey","grey", "blue")) + coord_flip()
grid.arrange(a, b, c, nrow=1)

(b) There are for sure enough outliers in the data, just by looking for example at the extreme high/low values from the blue pcp at the MalicAcid, Ash, Flav, Hue variables.

(c) Yes, there might be some evidence that these classes have subgroups of wines inside them. Some examples would be:

Exercise 8. Boston housing

Solution

data(Boston, package="MASS")

hcluster = hclust(dist(Boston), method='ward.D2')
clu4 = cutree(hcluster, k=4)
clus = factor(clu4)
boston1 = cbind(Boston, clus)

# (a)
ggparcoord(boston1, columns=1:14, groupColumn="clus", scale="uniminmax") + 
  xlab("") + ylab("")

# (b)
a = ggparcoord(boston1[which(boston1$clus == 1),], columns=1:14, scale="uniminmax",
               mapping=aes(color='#f5aca7')) + 
  xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("#f5aca7"))
b = ggparcoord(boston1[which(boston1$clus == 2),], columns=1:14, scale="uniminmax",
               mapping=aes(color='#a8c75a')) + 
  xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("#a8c75a"))
c = ggparcoord(boston1[which(boston1$clus == 3),], columns=1:14, scale="uniminmax",
               mapping=aes(color='#1ec5c9')) + 
  xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("#1ec5c9"))
d = ggparcoord(boston1[which(boston1$clus == 4),], columns=1:14, scale="uniminmax", 
               mapping=aes(color='#cc8afd')) + 
  xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("#cc8afd"))
grid.arrange(a, b, c, d)

# (c)
a = ggparcoord(rbind(boston1[which(boston1$clus != 1),], boston1[which(boston1$clus == 1),]),
               columns=1:14, groupColumn="clus", scale="uniminmax") +
  xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("#f5aca7","grey", "grey", "grey"))
b = ggparcoord(rbind(boston1[which(boston1$clus != 2),], boston1[which(boston1$clus == 2),]),
               columns=1:14, groupColumn="clus", scale="uniminmax") +
  xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("grey","#a8c75a", "grey", "grey"))
c = ggparcoord(rbind(boston1[which(boston1$clus != 3),], boston1[which(boston1$clus == 3),]),
               columns=1:14, groupColumn="clus", scale="uniminmax") +
  xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("grey","grey", "#1ec5c9", "grey"))
d = ggparcoord(rbind(boston1[which(boston1$clus != 4),], boston1[which(boston1$clus == 4),]),
               columns=1:14, groupColumn="clus", scale="uniminmax") +
  xlab("") + ylab("") +
  theme(legend.position = "none") +
  scale_colour_manual(values = c("grey","grey", "grey", "#cc8afd"))
grid.arrange(a, b, c, d)

The first plot could be useful when not enough space is available, however, when 4 clusters must be displayed, it becomes quickly really hard to distinguish the different categories, the plot becoming a messy mix of colors.

The second way, where we plot each cluster individually, looks much cleaner than the other ones. However, we can observe that the shapes are not the same as in the other two plots. That’s because plotting them individually will cause the minimum and maximum values on each axis to change, restricting them to the cluster domain only. So this way, a uniminmax scale would not be so helpful to make comparisons between clusters, but maybe a globalminmax would be more appropriate for that kind of task.

The third option, which consists of plotting each cluster individually and the other ones in the background, looks like a clean way of visualizing the groups and at the same time comparing their differences. It preserves the axis limits and we can spot much easier the outliers. If I had to choose, this would be the way I would display my clustering results.